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Abstract. We introduce and analyze a waiting time model for the accumulation of genetic 
changes. The continuous time conjunctive Bayesian network is defined by a partially ordered set 
of mutations and by the rate of fixation of each mutation. The partial order encodes constraints 
on the order in which mutations can fixate in the population, shedding light on the mutational 
pathways underlying the evolutionary process. We study a censored version of the model and 
derive equations for an EM algorithm to perform maximum likelihood estimation of the model 
parameters. We also show how to select the maximum likelihood poset. The model is applied to 
genetic data from different cancers and from drug resistant HIV samples, indicating implications 
for diagnosis and treatment. 



1. Introduction 



The genetic progression of cancer is characterized by the accumulation of mutations in onco- 
genes and in tumor suppressor genes. Recent studies have shown that during the somatic evo- 
lution of cancer mutations in over 100 human genes are selected for, suggesting their beneficial 



effect on the growth of the cancer cell (Sjoblom et al. 20061 



In HIV infection, the virus acquires mutations in CTL epitopes that interfere with the immune 
response. This evolutionary process is specific for the genetic makeup of the infected host. 



Recently, a total of 478 CTL escape mutations have been identified in the HIV genome (Brumme 



et al. 20071 



Under drug treatment, HIV develops mutations that confer resistance to the applied drugs. 
Eventually, this evolutionary escape leads to therapy failure. More than 50 drug resistance- 



associated mutations are known in three different HIV proteins (Johnson et al. 2006) 



These three evolutionary scenarios have in common that, for the population of individuals, 
several mutations are available which increase fitness. Adaption is therefore characterized by 
the accumulation of these beneficial mutations which are virtually non-reversible. 

In this paper, we introduce a statistical model for the accumulation of genetic changes. The 
continuous time conjunctive Bayesian network (CT-CBN) is a continuous time Markov chain 
model, defined by a partially ordered set (poset) of advantageous mutations, and the rate of 
fixation for each mutation. The partial order encodes constraints on the succession in which 
mutations can occur and fixate in the population. We assume that the fixation times follow 
independent exponential distributions. The exponential waiting process for a mutation starts 
only when all predecessor mutations of that mutation in the poset have already occurred. The 
order constraints and waiting times reveal important information on the underlying biological 
process with implications for diagnosis and treatment. 

The CT-CBN is a continuous time analogue of the discrete conjunctive Bayesian network (D- 
CBN) introduced by Beerenwinkel et al. (20061. The D-CBN was shown to have very desirable 
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statistical and algebraic properties ( Beerenwinkel et al. ). We argue that the continuous time 
CBN is the more natural model for the waiting process described above, and we explore the 
connection to the discrete CBN. A special case of the D-CBN, where the poset is a tree, is known 
as the oncogenetic or mutagenetic tree model (Desper et al. 1999; Beerenwinkel et al. 2005b|c 



It has been applied to the somatic evolution of cancer (Radmacher et al. 2001 



et al. 20051 and to the evolution of drug resistance in HIV (Beerenwinkel et al 



basic mutagenetic tree model has been extended to a mixture model (Beerenwinkel 



Rahnenfiihrer 



2005a). The 



et al., 2005b) 



and to account for longitudinal data (Beerenwinkel and Drton 2007). 

A related tree model by von Heydebreck et al. (2004) represents the genetic changes at the 
leaves of the tree and regards the interior vertices as hidden events. Several authors have con- 



sidered larger model classes, including general Bayesian networks (Simon et al. 2000; Deforche 



et al. 2006 1 and general Markov chain models on the state space of mutational patterns (Foulkes 



and DeGruttola 2003 Hjelm et al. 2006) . As compared to trees and posets, these models are 



more flexible in describing mutational pathways, but parameter estimation and model selection 
is considerably more difficult. In fact, the number of free parameters of these models is typically 
exponential in the number of mutations. By contrast, in the CT-CBN model, the number of 
free parameters equals the number of mutations. We demonstrate that parameter estimation 
and selection of an optimal poset can be performed efficiently for CT-CBNs. Thus, they provide 
an attractive framework for modeling the accumulation of mutations, especially if the number 
of mutations is moderate or large. 

We formally define the CT-CBN in the next Section 2 and derive some basic properties of the 
model. The CT-CBN is an example of a regular exponential family with closed form maximum 
likelihood estimates (MLEs). In Section 3, we make precise the relation between the CT-CBN 
and the D-CBN. Section 4 deals with a censored version of the CT-CBN which is most relevant 
for observed data. The censored model lacks a closed form expression for the MLE, but has a 
natural EM algorithm for approximating maximum likelihood estimates. We apply our methods 
in Section 5 to genetic data from cancer cells and from drug resistant HI viruses. We close with 
a discussion in Section 6. 



2. Continuous Time Conjunctive Bayesian Networks 



In this section, we introduce and describe some of the basic properties of continuous time 
conjunctive Bayesian networks (CT-CBN). These models are continuous time Markov chain 
models on the distributive lattice of a poset. To begin, we review some background material 
from combinatorics. The relevant combinatorial material can be found in introductory sections 



of Beerenwinkel et al. (20061 and more detailed information is covered in Stanley (1999). 



A partially ordered set (poset) is a set P with a transitive relation <. In our models, the set 
P will be a set of genetic events, and the order relation ^ specifies partial information about the 
order in which these events must occur. The relation p ~< q implies that event p happens before 
event q. The distributive lattice of order ideals of P, denoted by J{P), consists of all subsets 
5 C P, that are closed downward, i.e., S £ J{P) if and only if, for all g G 5 and p ^ q, we have 
that p £ S. The order ideals of P correspond to the genotypes (or mutational patterns) that 
are compatible with the order constraints. We refer to G J{P) as the wild type. 
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{1,2,3,4} 

, , X,^Exp(A,) {^'^'^^ 

\/\ ^2~Exp(A2) {1^2} {2,4} 

3 4 X3 ~max(Xi,X2) +Exp(A3) / \ / 

X4 ~ X2 + Exp(A4) {1} {2} 



Poset, P Waiting times, X Lattice of order ideals, J{P) 

Figure 1. Running example. 

Example 2.1. As a running example, consider the poset P on the four element set [4] = 
{1,2,3,4} subject to the order relations 1 ^ 3, 2 -< 3, 2 ^ 4. The distributive lattice J{P) of 
order ideals of P consists of eight elements. Both sets are displayed in Figure [T} 

Let P be a poset with ground set [n]. For each event i £ P, we define a random variable 
Zi ~ Exp(Aj). Then we define the random variables Xi as 

Xi = max {Xj} + Zi, i = 1, . . . ,n. 

jepa(i) 

Here pa(i) is the set of all predecessors of event i in the poset P. The random variable Xi 
describes how long we have to wait until event i occurs assuming that we start at time zero with 
no events. Mutation i cannot occur until all the mutations preceding it in the partial order P 
have occured. The family of joint distributions of X defined in this manner is the continuous 
time conjunctive Bayesian network (CT-CBN). It has state space M"q consisting of vectors of 
waiting times and parameters A = (Ai, . . . , A„) G KI^q. 

The probability density function associated to this model is easy to write down, given the 
recursive conditional nature of the distribution. The density function is 

n 

(1) fp,\{'t) = 11 -^i exp(— Ai(tj — max tj)), if U > max tj for all i G [n], 

~^ iepa(i) i6pa(i) 

and fp^\{t) = otherwise. The CT-CBN is an example of a regular exponential family, with 
minimal sufficient statistic consisting of the vector of time differences {ti — maxjgpa(j) 

One instance of the random variable X is a sequence of times T = (ti, . . . , tn) that satisfy the 
inequality relations implied by P, i.e., ti > max^^pj^^j) tj for all i G [n]. A set of times satisfying 
the indicated inequality constraints is compatible with the poset. Thus, the data for the model 
is a list of sequences of times Ti, . . . , T/v, where N is the number of observations. 

Proposition 2.2. Let P be a poset and Ti, . . . , T/v be a collection of data. If any of the are 

incompatible with the poset P, the maximum likelihood estimate does not exist (the likelihood 
function is identically zero). Otherwise, the maximum likelihood estimate of X is given by 

Z^k=i\'^ki — maxjgpa(j) ikjj 
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Proof. Suppose the data Ti, . . . ,Ti^ are compatible with the poset P. Then from Equation [T| 
the log-likelihood function is 

N n . . 

£(Ai, . . . , A„) = log Ai - \i{tki - max tkj) ■ 

Differentiating with respect to Aj yields the equations 

N 



yZ (t- - (tki - max tkj)] = 0, 

^ V^i i6pa(j) / 



and the claimed formula follows by solving for Aj. □ 

Theorem 2.3. Given data Ti, . . . ,T/v, the maximum likelihood poset is the largest poset that is 
compatible with the data. 

Proof. Let (P^,-<i) and {P'^,^2) be two posets, both compatible with the data, such that P^ 
is a refinement of P^ (that is, every relation that holds in P^ also holds in P^). It suffices to 
show that the likelihood function evaluated at the MLE is larger for P^ than P^, because this 
implies that adding relations compatible with the data increases the likelihood. 

Denote the MLEs for P^ and P^ by A^ and A^, respectively. According to Proposition 
these values are given by 



2.2 



A 



N 



Ylk=i(.'tki — maxjgpaj(j) tk 



It does not change the expression to replace pa;(z) with the set {j G '■ j i in P''}- Since 
P^ has more relations than P^, this implies that the maximum is taken over a strictly larger 
set, and thus Xj > A? for all i. 

However, the log-likelihood function evaluated at A' is 

^ " / ^ ^ \ 
ii{X'\T) = logA^-A^(t,,- max tfc,) 



N 

y2 [n log Xl-XlY^{tki- max tkj, 



5^(iVlogA^-iV). 



i=l 



Since the logarithm is a monotone function, we deduce that ^i(A"'^ | T) > i2{X'^ I ^ 

One of the most interesting quantities that we can compute with respect to the CT-CBN 
is the expected waiting time until a particular pattern S £ J{P) is reached in the course of 
evolution. In other words, assuming that the parameters A are known, we are asking how long 
it takes until a certain collection of genetic events have occurred. The expected waiting time is 



an important measure of genetic progression. Rahnenfiihrer et al. (2005) have shown that, for 



mutagenetic trees, it is a prognostic factor of survival and time to relapse in glioblastoma and 
prostate cancer patients, respectively, even after adjustment for traditional clinical markers. 
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Note that because the exponential distributions are memory less, calculating the waiting time 
from the wild type will also serve for determining the waiting time between any two patterns. 
Furthermore, the nature of the conditional factorization for the joint density of X implies that 
we can restrict attention to the case where S = P, i.e., to determining the waiting time until all 
events have occurred. 

Let S G J{P) be an observable genotype. We define Exit(S') = {j e P \ j ^ S, Su{j} G J(P)} 
to be the set of events that have not occurred in S, but could occur next. For any subset 
T C P, we set At = X^jer ■^i- ^ chain in the distributive lattice J{P) is a collection of subsets 
Co, Ci, . . . ,Ck G J^P) that satisfy Ci C Cj+i for all i, with all containments strict. A chain is 
maximal if it is as long as possible. Note that all maximal chains in the distributive lattice J(-P) 
have length n + 1 with n = \P\ and start with the empty set as Cq and reach the maximum 
at Cn = P- Let C(J(P)) denote the collection of maximal chains in J{P); a typical element is 
denoted C = (Cq, . . . , Cn)- 

Theorem 2.4. The expected waiting time until all events have occurred is given by the expression 



E[maxXi] = Ai • • • A„ Yl H 



C6C(J(P)) 




i=0 ^Exit(C,) J y-^Q ^^^it{C,) 



Before proving Theorem |2.4[ we want to briefly mention the idea of the proof, because it will 
be a common technique for proofs throughout the paper. First of all, the indicated expectation 
involves the integral of a function that depends on maxima, which are not so simple to integrate 
directly. So the first step is to decompose the integral into a sum of integrals over many different 
regions (one for each maximal chain in J{P)). Over these simpler regions, the maximum function 
disappears. Furthermore, these regions are each simplicial cones and the integral can then be 
computed by a simple change of coordinates. 

Proof. Let f{t) be the density function from Equation [T| We must compute 
(2) / maxti- f{t)dt 

Let Sn denote the symmetric group on n letters with a = {ai, . . . a typical element. The 
integral ^ over the positive orthant breaks up as the sum 



taj{t)dt. 



That is, the sum breaks up the integral into smaller integrals over regions 

<t^^ <ta2 < ■■■ <ta„. 

The integrand is zero unless ui, cj2, . . . , (in is a linear extension of the poset P. In other words, 
the integrand is zero unless the sets Ci = U*-^^{(Tj} for i = 0, . . . ,n form a maximal chain in 
the distributive lattice J{P)- So suppose that a is a linear extension of P. Without loss of 
generality, we may suppose that this linear extension is 1 ^ 2 ^ • • • ^ n. 
We must compute the integral 

'oo poo fOO 

/ ••• / tnfit)dt 

tl=0 J t2=ti J tn=tn — l 
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where over this restricted region, /(t) now has the form 

n 

fit) = n exp(-Ai(ti - 



2=1 



where is the largest number with -< i in P. 
Now introduce the change of coordinates 

uo = ti, Ui = ti+i - for i = 1,2, . . . ,n - 1. 

The determinant of this Unear transformation is one, so the integral becomes 

/ / ••• / [uq^ hu„_i)PjAiexp(-Ai(ui_i +'Ui_2 H h 



Uj(i))) du. 



The multiple integral is now over a product domain, and involves a function in product form, 
so we want to break this integral up into the product of integrals. To do this, we must collect 
the Aj terms that go with the various terms. In the exponent, we have that Aj appears as 
a coefficient of if and only \i i > k > This, in turn, implies that when all the events 

1,2, have occurred, all the predecessor events of i have occurred. But this means that 

i G Exit(Cfc), where = {1,2 ... ,k}. Thus, the transformed integral breaks up as a sum of n 
integrals that have the form: 

poo poo poo 

Ai • • • An • / / • • • / exp(-AExit(c,)"i) du. 

J UQ=0 J U\=0 Jun-1=0 j_Q 

The integral is over a product domain of a product function. By elementary integration, it is 

n-l 



Ai • • • A„- Yl T ■ 

^Exit(Cj) ^0 AExit(Ci) 



which completes the proof. □ 

3. Relation to the Discrete Conjunctive Bayesian Network 
In this section, we explore the connection between the CT-CBN and the discrete CBN (D- 



CBN) introduced in Beerenwinkel et al. Part of the motivation for this project was to under- 
stand how the two types of models relate to each other and how structural information from 
one model can be used to deduce information about the other. Also, we are naturally led to 
study discrete models because we rarely have access to the times at which the individual events 
occurred, but can only check, after a certain sampling time, which of the events have occurred. 

We will show that the D-CBN gives a first order approximation to the transition probabilities 
in the CT-CBN. This suggests that the D-CBN is not optimal from a modeling standpoint as 
the nature of our applications is to wait until mutations occur. On the other hand, the D-CBN 
is much simpler to work with, and maximum likelihood estimates for the D-CBN can be used as 
a first step for iterative algorithms for ML estimation in the censored versions of the CT-CBN, 
described in Section 4. 

To explain the first order approximation result, we consider the CT-CBN as a continuous 



time Markov chain on the distributive lattice J{P) (see Norris (19971 for background on Markov 
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chains). The rate matrix for the Markov chain is the upper-triangular m x m matrix Q, where 
m = \J{P)\. The entries of Q are indexed by pairs of sets of occurred events S,T G J{P) and 
are given by 

' Xj if S CT and T\S = {j}, 

Qs,T = I -AExit(5) a S = T, and 
otherwise. 

V 

If we fix a hnear extension of J{P) and order the rows and columns of Q according to this linear 
extension, Q will be an upper triangular matrix. 

Let p{t) be the mxm matrix where, for S*, T G JiP), the entry ps.rit) denotes the probability 
that the continuous time Markov chain with state space J{P), is in state T at time t starting 
from state S at time 0. This quantity can be calculated by integrating the density function from 
the continuous time model. However, it is simpler to calculate from standard theory of Markov 
chains. Indeed, the matrix p{t) is the solution to the system of differential equations 

j^p{t) = Qp{t) 

subject to the initial conditions p(0) = J, the identity matrix. The solution to these first 
order differential equations is obtained by taking p(t) = exp((5t), where exp denotes the matrix 
exponential: 

exp(Qi) = / + Qt+^ + ^ + .... 
Clearly, the solution to the differential equations then satisfies 

so a first order approximation to p(t) is a function p{t) that satisfies 

m = I and j^m\t=i) = Q- 

A first order approximation to the CT-CBN can be derived from the D-CBN as follows. 

Associated to the D-CBN are n parameters 6i, . . . ,9n, where 9i is the conditional probability 
that event i has occurred, given that all its predecessor events have occurred. By setting 9i = 
1 — exp(— Ait) and defining 

ps,T{t)= n n (1-^*)' if-scr, 

ieT\S ieExit(T) 

and ps,T{t) = otherwise, the D-CBN is naturally interpreted as a continuous time model, 
where ps,T{t) is the probability that the D-CBN is in state T at time t given a starting state of 
5 at time 0. 

Proposition 3.1. The model p{t) derived from the D-CBN is a first order approximation to the 
CT-CBN p{t). 

Proof. Note that diagonal entries oip(t) have the form ]^exp(— A^t), and off-diagonal entries are 
either identically zero or are a product of terms, at least one of which is of the form 1— exp(— Ajt). 
This implies that p(0) = /. 
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liT = S, then ps,s{t) = HieExitCS) exp(-Ait). So, by the product rule 

J^PS,s{t) = 5Z -^iPS,s{'t) 
ieExit(S) 

and thus jj.ps,s{t)\t=o = -AExit(5) = Qs,S- H S CT then 

PS,Tit) = n (1 - exp(-Ait)) ■ JJ exp(-Aii) 

ieT\5 jeExit(T) 



and thus 



i(zT\S igExit(T) 



If r \ S* has cardinahty greater than one, then ^ps^T{t)\t=o = = Qs,T, since every term 
in the sum involves at least one expression of the form 1 — exp(— Ajt). On the other hand, if 
r = 5 U {j}, then ^pgx[t)\t=o = Aj = Q,s,t, since only the first term in the sum does not 
contain an expression of the form 1 — exp(— Ajt). As all other entries in p(t) and Q are zero, this 
proves that p{t) is a first order approximation to p{t). □ 

Given that the D-CBN is a first order approximation to the CT-CBN, it seems natural to 
conjecture that these two models are, in fact, equal to each other. Indeed, if P is the poset with 
no relations, it is easy to show that these two models coincide. However, if P contains at least 
one relation, the models are no longer the same, and the D-CBN is not even a second order 
approximation of the CT-CBN. This is illustrated in the following example. 

Example 3.2. Let P be the poset on two elements with one relation 1 -< 2 and fix the natural 
order 0, {!}, {1, 2} in J(P). If Ai ^ A2, then 

e-Ait)(i _ e-'^2*)^ 
p{t) = I e-^2t 1 _ e--^2t 



and 





(1 - e-^i*)e- 


-A2t (1 





g-A2t 












-Alt 


Ai ,'„-A2t 
A1-A2 


e-Ait^ 





e-A2t 













-A2e 



p{t) =1 ' " e-^2t 1 _ e-A2t 



In particular, ^p$,{i}{t)\t=o = -Af - 2A1A2, whereas (Q^)0,{i} = -Af - A1A2. 

The discrepancies exhibited in the previous example become more dramatic as the poset P 
develops longer chains. While the D-CBN is not identical to the CT-CBN, its nice properties 
can be exploited at various points during optimization. 

4. Censoring 

Our goal in this section is to define and analyze a censored CT-CBN model that we will 
apply to genetic data in Section 5. The reason for introducing models with censoring is that we 
rarely know explicitly the time points ti,. . . ,tn at which the events have occurred. Often we 
can only measure, at a particular time, which of the events have occurred so far. It is natural 
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to assume that the observation times are themselves random. For example, the evolutionary 
process leading to drug resistance in HIV starts with the onset of therapy. However, the genome 
of the virus can only be determined after viral rebound, i.e., after loss of viral supression, which 
typically involves several mutations. Thus, the sampling time is the time to therapy failure. 

We introduce a new event s, such that the random variable Xg is an independent, exponentially 
distributed stopping time (or sampling time, or observation time), Xs ~ Exp(As). We define 
a new posct = P U {,s} by adding to the posct of mutations the element s, which has no 
relations with the other elements in P. In this setting, the events we observe consist of subsets 
SC. [n], which correspond to the event that Xi < Xg for all i G 5 and Xg < Xi for all i G [n] \ S. 

Thus, an observed set of events S imposes extra relations i < s iov i ^ S and s ^ i for 
i G [n] \ S" on the poset Pg, and we are led to study the posct refinements of P^. A poset 
Q is said to refine the poset Pg if every relation in Pg also holds in Q. A realization of the 
random vector X is said to be compatible with Q, denoted X \- Q, \i Xi < Xj whenever 
i ^ j in Q. We can directly compute the probability of the event X \- Q in terms of the 
distributive lattices J{Q) and J{Ps)- Throughout this section, we abuse notation and say that 
X ^ Pg ]£ X = {Xg ,Xi,..., Xn) is distributed according the CT-CBN associated to poset Pg 
with parameter vector A = {Xg, Ai, . . . , A„). 

Theorem 4.1. The probability that X ^ Pg is compatible with the poset Q is given by 

n 

(3) Prob(X h g) = A,Ai • • • A„ W^. ' 

CeC(J(Q)) i=0 Exit(Ci) 

where the sum runs over all maximal chains in the distributive lattice J{Q) . 

Remark. Note that in this formula, and in all formulas throughout this section, the expression 
Exit(S') always refers to the underlying poset Pg and not to the refinement Q. 

Proof. We must compute the integral 

/ lQ{t)-f{t)dt 

where II(5(t) is the indicator function of compatibility with the posct Q. The integral breaks up 
into a sum over the linear extensions of the poset Q over regions over the form t^Q < ta^^ < ■ ■ ■ < 
ta„- Without loss of generality and after renaming the elements of the poset Pg, we can assume 
that the linear extension of interest is ^ 1 ^ ■ ■ ■ ^ n. We must calculate the integral 




f{t)dt 



where over the restricted region, the integrand has the form 

n 

fit) = Yl^i exp(-Ai(ii - tj^i))). 

i=0 

Taking the usual change of variables uq = to and Uj+i = i^+i — for i = 1, . . . , n we see that 
the integral becomes 

" />oo 

n / (^W{->^Exit{Ci)Ui)dUi 
i=0 •^"i=0 
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where Ci = {0,1, ... ,i — 1}. This yields the desired contribution to the integral. □ 



Example 4.2. Let P be the poset from Example 2.1 with relations 1 ^ 3, 2 -< 3, and 2^4, 
and consider the extended poset Pg = P L) {s} with no additional relations. Suppose we want 
to calculate the probability of precisely mutations 2 and 4 occurring before measurement. The 
refinement Q2,4 corresponding to the genotype {2,4} is a chain 2^4^s^l^3, and so the 
distributive lattice J(Q2,4) is also a chain. From Equation [s] we see that 

Prob(X h Q2,4) = A1A2A3A4A, ^ ^ 111 



Ai + A2 + As Ai + A4 + Xg Ai + Xg Ai A3 

On the other hand, the distributive lattice J(Qi,2) has four chains and Prob(X h Qi,2) is the 
sum of four terms of product form. The terms in the sum have common factors, and this 
expression can be rewritten as 

p uXhO ) = A1A2A3A4A, f 1 1 V 1 1 

'° ^ ^'''^ (Ai + A2 + A,)(A3 + A4 + A,)(A3 + A4) VA2 + A, A1 + A4 + AJ VA3 A4 

The method for building the expression for this probability recursively is explained in Proposition 
1441 □ 



Given a poset P and a set of parameters Ai, . . . , A„, and A^, we obtain a probability distribu- 
tion p{P, A) C the 2" — 1 dimensional probability simplex. The set of all such probability 
distributions is the discrete censored CT-CBN. Although there are n + 1 parameters specifying 
the model, it is easy to see that the family of probability distributions that can arise has di- 
mension n. The loss of dimensionality arises from the fact that p{P,X) = p{P,tX). This is an 
example of an algebraic statistical model (Pachter and Sturmfels, 2005| ), since the probability 



p{P, X) is a rational function of the parameters A. Unfortunately, these models seem difficult 
to analyze using techniques from algebraic geometry. Indeed, even the model associated to the 
poset with no relations, corresponding to independently occurring mutations, lacks a simple 
description as an algebraic statistical model. 

Unlike in the fully observed CT-CBN or the D-CBN, we have found no general closed form 
expressions for the MLEs of the parameters of the censored CT-CBN. However, as the censored 
model is a marginalization of the CT-CBN and the CT-CBN is a regular exponential family. 



we can use the EM algorithm to find ML estimates (see Chapter 8 of Little and Rubin ( 2002 ) 
for a general description of the EM algorithm) . While the EM algorithm is only guaranteed to 
find a local maximum of the likelihood function, our computational experience has been that 
using exact MLEs from the D-CBN for 6i and solving for Ai in the approximate expression 
6i = Xi/{Xi + As) works as a good starting guess for the EM algorithm. 

In the EM algorithm for a marginalization of a regular exponential family, we start with a 
guess for the ML parameters A*. Then, given the data, we compute the expected values for the 
missing sufficient statistics of the fully observed regular exponential family. In our setting, we 
need to compute, for each S C [n] that is observed, and for each i £ [n], the expected value 

(4) K[Xi - max Xj \ X h Qs]. 

jepa(j) 

This is the E-step of the EM algorithm. The expected sufficient statistics are then used to 
compute MLEs for A in the fully observed CT-CBN. This is the M-step of the EM algorithm. 
The EM-algorithm iterates alternations of the E-step and the M-step. After each iteration, the 
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likelihood function is guaranteed to increase. A fixed point of the EM algorithm must be a 
critical point of the likelihood function. Running the EM-algorithm with many starting points 
is a useful heuristic for calculating maximum likelihood estimates. 

Since the M-step in our EM algorithm is trivial to calculate from Proposition 2.2, the only 
thing remaining to compute is the expected value Q . The formula is similar to the one that 
appears in Theorem |2.4[ 

Theorem 4.3. The expected value of Xi — maxjgpa(j) Xj given that X is compatible with Q is 



E[X,- max XAX h Q] = ^^^i"-^" ^ f] 1 

..pa(, ^ Prob(X h Q) ^^^^^^^ \ U x^^^ ^ 



i{i,Ck) 



,2 ^Exit(C,)^ 



where the first sum is over all maximal chains in J{Q) and 

L{i, Ck) = 



1 if i ^ Ck and pa(i) C Ck 
otherwise. 



Proof. The proof follows the same basic outline of the proof of Theorem 2.4 The expected value 
is 



/ {ti- max tj) -iQit) ■ f{t)dt. 
Jr"+^ jepa(i) 



We can calculate the integral by decomposing it into a sum over the linear extensions of Q, i.e., 
the chains in the distributive lattice J{Q). Without loss of generality, we can suppose that the 
linear extension is called ^ 1 ^ • • • ^ n. For this linear extension, the integral becomes 

{ti-tj(^i))f{t)dt 

'to=0 Jt\=tQ Jtn=tn-\ 

and over this region f{t) has the form 

n 

fit) = Y[ Afcexp(-Afc(tfc - tj{k)))- 

k=0 

Applying the usual change of coordinates, we can rewrite this integral in product form as 
/ / • • • / [ui-i H h Afc exp(-Afc(?Xfc-i + Uk-2 H h du. 

J U0=0 J Ul=0 J Un=0 k = 

Breaking this integral up as a sum, yields a collection of integrals we have already computed in 
the proof of Theorem 2.4 However, each term 

1 \ 1 



n 



contributes to the sum if and only if i G Exit(Cfc), i.e., if and only if i{i, Ck) 



1. 



□ 



Rather than computing the expectation from Theorem 4.3 by explicitly listing all maximal 
chains in the distributive lattice J{Q), the expected value can be computed recursively, by 
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summing up the distributive lattice. This approach is sometimes referred to as dynamic pro- 
gramming. It reduces the computational burden of computing the expectation, because one need 
not enumerate all the chains in J{Q). 

Proposition 4.4. For each S G J{Q) define Ps and E'g by the formulas 

Ps= E , Ps\m 

jeS:S\{j}eJ{Q) V^Exit(5\{j}) ^E^t{S\{j}) J 

subject to the initial conditions P0 = 1 and = 0, where 

q \ r,•^^ _ / 1 ifi^S\{j} and pa(i) OS\ {j} 
H^^ - I otherwise. 

Then Prob(X H Q) = P{s}u[n] and E[Xi - maxj^p^^i) Xj\Xh Q]= £;|^-^^j^]/P|^|u[„] • 

Proof. Both results follow from writing down a closed form for Ps and E^, proving that these 

E'- 

formulas hold inductively, and showing that Pis\u\n] — Prob(X H Q) and p^"*^^"^ = KlXi — 

^ J L J {s}U[n] 

Xj\XhQ]. 

To this end, let Q\s be the induced subposet of Q with element set S. Then 
with P0 = 1. The recurrence 

Ps= E ^ Ps\{,} 

,eS:S\{,}eJ{Q) ^Exit(5\0}) 

is satisfied because every maximal chain in J{Q\s) comes from a maximal chain in exactly one 
of the J{Q\s\{j}) by adding j to the poset Q\s\{j} the last element. Also, P{s}u[n] has the 
desired form. 

Similarly, it is straightforward to show that 

\s\-i ^ \ /|Shl f. ^ . 



fce5 CGC{J{Qs)) 



k=0 '^Exit(Cfc) I \ ■^Exit(Cfc) 
which, together with Ps, above, satisfies the desired recurrence relation. □ 

5. Applications 



In this Section, we use the CT-CBN model to describe the accumulation of mutations in four 
different biological systems (Table [T|. The random variables Xi denote the times of fixation of 
genetic changes in a population of individuals. The definition of a genetic change is different 
in each example, depending on both the nature of the evolutionary process and the technology 
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Biological system Genetic events 



# Samples Ref. 



Prostate cancer 
Colon cancer 
Breast cancer 
HIV drug resistance 



Chromosomal gains and losses 9 54 

Mutated genes 12 35 

Mutated genes 9 42 

Amino acid changes in the HIV RT 7 364 



Rahnenfiihrer et aLI (|2005 1 
Bjoblom ct al. (2005 
Bjoblom c t al.| (|2006" 
Bccrenwinkel etaT (2005a I 



Table 1 . Applications of the CT-CBN model to genetic data. For each biological 
system (first column), the nature (second column) and the number (third column) 
of the genetic alterations, the number of observations (fourth column), and a 
reference pointing to the original study (last column) is shown. 



to detect genetic alterations. For all examples, the data is a list of genotypes that have been 
observed after an unknown sampling time assumed to be exponentially distributed. Thus we 
apply the censored CT-CBN model. Our goal is to learn the structure of mutational pathways, 
which is represented by the linear extensions of the CT-CBN defining posets. 

Theorem |2.3| states that the structure of the ML CT-CBN model is given by the maximal 
poset that is compatible with the observed data. In practice, however, the observations are 
subject to noise, either due to deviations of the data generating process from the model, or due 
to technical limitations in assessing genetic changes. Thus, for most biomedical data sets, the 
ML poset will have very few relations, although a large portion of the observations might support 



more order constraints. We adress this problem following the approach outlined in Beerenwinkel 
let al.l for the discrete CBN. 

Consider a family of posets -Pe (0 < e < 1), each of which is maximal with the property 
that a fraction e of the data is allowed to be incompatible with the poset. We assume that the 
incompatible genotypes are generated with uniform probility = 1/(2" — | J(Pe)|) and consider 
the extended model with probabilities 



Pr(Xe h g I a. A) 



a Pr(X, h Q I A) if Q refines 
(1 — a) Qe else. 



where ~ and a = X^^e J(Pf) %/ Sge2["l % denotes the fraction of the data that are 
compatible with P^. The model can also be interpreted as a mixture model with a the ML 



estimate of the mixing parameter (Beerenwinkel et al. Prop. 8). In the applications, we construct 



several posets Pe for various values of e and select the poset that maximizes the likelihood of 
the extended model. 

In the first application, we analyze data from comparitive genome hybridization (CGH) ex- 
periments. This technique detects large scale genomic alterations, namely the gain or loss of 
chromosome arms, that occur frequently in cancer cells. For example, the event 4q+ denotes 
the gain (-|-) of additional copies of the large {q) arm of chromosome 4. Likewise, 8p— refers to 
the loss (— ) of the small arm (p) of chromosome 8. We consider 54 prostate cancer samples, 
each defined by the presence or absence of the nine alterations 3g-|-, 4g-|-, 6q+, 7q+, 8p—, 8q+, 



lOg— , I'iq+j and Xq+ as defined in Rahnenfiihrer et al. (20051. 



In Figure |2] the log-likelihood is shown as a function of the fraction of incompatible genotypes. 
The poset that maximizes the likelihood explains 89% of the data and is displayed in Figure [s} 
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Fraction of incompatible genotypes, 1 - a 



Figure 2. Maximum likelihood estimation of the discrete censored CT-CBN 
model for the prostate cancer data. The log-likelihood is displayed as a function 
of the fraction of data that is incompatible with the poset. The curve has been 
generated by densely sampling e from the unit interval and estimation of the 
extended models P^. 



\ 

8q+ 6q+ 




13q+ 3q+ 7q+ 8p- lOq- Xq+ 

Figure 3. Optimal prostate cancer poset corresponding to the maximum in 
Figure |2} An arrow p ^ q between two genetic events represents the cover 
relation p ~< q in the Hasse diagram of the poset. 

Four of the nine genetic changes do not obey any relation, two events have one predecessor, and 
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one event occurs only after two parent events have occured. Note that the second best poset is 
the empty poset corresponding to a = 0. 

In our second and third example, we consider mutation data from 35 colon and 42 breast cancer 



tumors, respectively (Sjoblom et al. 20061. Here, a genetic event is an unspecific mutation in a 



gene that has been detected by DNA sequencing. Out of the ~ 200 genes identified by Sjoblom 



et al. (2006) we considered those that were mutated in at least four tumors. For colon cancer, this 



set comprises ADAMTSL3, APC, EPHA3, EPHB6, FBXW7, KRAS, MLL3, OBSCN, PKHDl, 
SMAD4, SYNEl, and TP53, while for breast cancer, we identified ATP8B1, CUBN, FLJ13479, 
FLNB, MACFl, OBSCN, SPTANl, TECTA, and TP 53. 




Fraction of incompatible genotypes, 1 - a 



Figure 4. ML estimation of the CT-CBN model for the colon cancer data. 

The colon cancer ML poset is the empty poset (Figure |4]). By contrast, the maximum like- 
lihood breast cancer poset explains 93% of the observations (Figure [5]) and consists of three 
relations (Figure [6]). Two of them form the conjuction stating that mutations in both the cu- 
bilin gene {CUBN) and the obscurin gene {OBSCN) occur before the f3 filamin gene {FLNB) is 
mutated. The third relation identifies tumor protein p53 ( TP53) as the mutational predecessor 
of the zinc finger protein 668 {FL J 13479). TP53 is mutated in most colon and breast cancer 
tumors and known to occur early in the somatic evolution of many cancers. 
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Figure 5. ML estimation of the CT-CBN model for the breast cancer data. 



CUBN OBSCN TP53 



FLNB FLJ13479 ATP^Bl MACFl SPTANl TECTA 

Figure 6. Optimal breast cancer poset correponding to the maximum in .Figure [5} 



The last application is concerned with the evolution of drug resistance in HIV. We study 
the accumulation of amino acid changes in a segment of the HIV pol gene that codes for the 
viral protein reverse transcriptase (RT). The seven resistance-associated mutations 41L, 67N, 
69D, TOR, 210W, 215Y, and 219Q ( [Johnson et al. 20061 are considered, where, for example, 
41L indicates the presence of the amino acid leucine (L) at position 41 of the RT. A total of 
364 viruses are analyzed that have been isolated from infected patients under therapy with 
zidovudine, an antiretroviral drug targeting the RT. Amino acid changes have been inferred 
after DNA sequencing of the pol gene. 

The optimal poset for the HIV drug resistance data explains 87% of the observations (Fig- 
ure [7| . Its Hasse diagram has two connected components (Figure [8| . The first one represents 
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Figure 7. ML estimation of the CT-CBN model for the HIV drug resistance data. 



215Y 

i 

41L 

i 

2WW 



70R 

Q7N 219Q 
69D 



Figure 8. Optimal HIV drug resistance poset correponding to the maximum in .Figure [7| 



the linear pathway 215Y -< 41L -< 210W, whereas in the second component mutations 70R, 67N, 
219Q, and 69D, form a rhombus beginning with 70R and ending with 69D. This finding confirms 



previous studies in which the same clustering of mutations has been described (Beerenwinkel 



et al. , 2005a[ Boucher et al. 



1992 Larder and Kemp 19891. The two groups of mutations are 
often referred to as the "215-41 pathway" and the "70-219 pathway", respectively. They provide 
alternative (but not exclusive) routes to resistance for HIV. The CT-CBN model captures this 
escape behavior and suggests order constraints within each group. 
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In all applications discussed here, there are several posets with near-optimal performance. 
Throughout we find these posets to be very similar to the optimal set of relations. For example, 
the second best HIV drug resistance poset explains 93% of the data and it differs from the 
optimal one in Figure [8] only in the second component which consists of the two relations TOR 
-< 219Q and TOR -< 69D. Using cross-validation techniques the variation in poset selection or in 
the selection of individual relations can be studied and may be used to select more robust poset 
structures. 

For each of the four examples, the best mutagenetic trees show inferior performance as com- 
pared to the CT-CBN posets. For example, the mutagenetic tree for prostate cancer found in 



Rahnenfiihrer et al. (2005 Fig. 2) explains only 56% of the data at a log- likelihood of only 
—248.4. Unlike mutagenetic trees, CT-CBNs can model the requirement of multiple parent mu- 
tations. This type of order constraint was found in three of the four applications considered 
here. 



6. Discussion 



Conjuctive Bayesian networks are statistical models for the accumulation of mutations. They 
are defined by a poset of mutations, which encodes constraints on the order in which mutations 
can occur. Here we have introduced the CT-CBN, a continuous time version of this model in 
which each mutation appears after an exponentially distributed waiting time, provided that 
all predecessor mutations in the poset have already occurred. In an evolutionary process, this 
waiting time includes the generation of the mutation plus the time it takes for the allele to reach 
a frequency in the population that allows for its detection. Since we consider only mutations 
with a selective advantage, the waiting time will be dominated by the mutation process in large 
populations and by the fixation process in small populations. However, the model does not 
make any assumptions about the underlying population dynamics. Hence the waiting time is 
the time to detection of the mutation, which depends on the technology used for measurement. 
For example, population sequencing of HIV samples can detect mutations with a frequency of 
at least 20%. 

The CT-CBN informs about the order in which mutations tend to occur. For a set of n 
mutations, the number of possible pathways to evolve the wild type into the type harboring all 
n mutations is the number of linear extensions of the poset. This cardinality increases rapidly 
with n, but is hard to compute exactly (Brightwell and Winkler 19911 . However, evolution 



appears to follow only very few mutational paths to fitter proteins (Weinreich et al. 20061 



Thus, we generally expect to find posets with much smaller lattices of order ideals than the full 
Boolean lattice. Evolution takes place in this reduced genotype space and can be modeled more 
efficiently. 

Estimation of the genetic event poset from observed data helps understanding the phenotypic 
changes and biological mechanisms responsible for the fitness advantage. Furthermore, the poset 
allows for identifying early and essential mutational steps that may be predictive of clinical 
outcome or point to promising drug targets. In cancer research, Fearon and Vogelstein (1990) 



have proposed linear pathways of genetic alterations as a model of tumorigenesis. These models 



are known as "Vogelgrams" today (Gatenby and Maini 20031. The CT-CBN can be regarded 



as a generalization of the Vogelgram that is equipped with a statistical methodology for model 
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selection and parameter estimation. In particular, the CT-CBN allows for multiple evolutionary 
pathways and makes explicit the timeline for the genetic alterations. 

We have derived equations for the ML estimates of the model parameters and for the expected 
waiting time of any genotype. These results are used in the EM algorithm for parameter esti- 
mation in the censored model. Censoring is modeled by assuming an exponentially distributed 
sampling time of the observed genotypes. This model appears most relevant for the data sets 
available, which often comprise cross-sectional data sampled after different but unknown time 
periods w.r.t. the evolutionary process. Other censoring schemes might be applicable in the 
future and could also be worked out. For example, the sampling time (but not the time of 
appearance of each mutation) can be observed in some situations, giving rise to a different 
marginalization of the fully observed CT-CBN. 

Since model selection relies on a simple combinatorial criterion, and the number of model 
parameters is only linear in the number of mutations, we expect the CT-CBN to scale well with 
increasing data sets both in the number of observations and the number of mutations. In the 
cancer and HIV applications presented here, there are between 7 and 12 genetic events and 
35 to 364 observations. It is likely, however, that the number of genes associated with cancer 



progression, for example, is much higher than currently known (Sjoblom et al. 2006). The 



running time of the EM algorithm is dominated by the size of the distributive lattice of order 
ideals. Thus, many mutations can be modeled as long as the number of mutational pathways is 
limited. 
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